Skip to content

Fix race condition in GC#402

Merged
Besroy merged 1 commit intoeBay:mainfrom
Besroy:fix_crash
Mar 24, 2026
Merged

Fix race condition in GC#402
Besroy merged 1 commit intoeBay:mainfrom
Besroy:fix_crash

Conversation

@Besroy
Copy link
Copy Markdown
Contributor

@Besroy Besroy commented Mar 23, 2026

When GC resets move_to_chunk via purge_reserved_chunk(), stale repl_reqs may still exist and be cleaned up by background gc_repl_reqs(). This causes two race conditions:

  1. Stale rreq frees blk on NEW allocator after reset (wrong allocator)
  2. Stale rreq frees blk on OLD allocator during reset, accessing destroyed superblock and causing crash

Issue link: #401

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 23, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 54.20%. Comparing base (1746bcc) to head (66c61c2).
⚠️ Report is 167 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #402      +/-   ##
==========================================
- Coverage   63.15%   54.20%   -8.96%     
==========================================
  Files          32       36       +4     
  Lines        1900     5267    +3367     
  Branches      204      656     +452     
==========================================
+ Hits         1200     2855    +1655     
- Misses        600     2114    +1514     
- Partials      100      298     +198     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@xiaoxichen xiaoxichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

// NOTE: This introduces potential double-free risk with gc_repl_reqs() background thread.
// See https://github.com/eBay/HomeObject/issues/401.
auto hs_pg = m_hs_home_object->get_hs_pg(pg_id);
hs_pg->repl_dev_->clear_chunk_req(move_to_chunk);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need similar for move_from_chunk

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls move this change into purge_reserved_chunk and before vchunk->reset()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved. Also changed chunk to vchunk->get_chunk_id() (IIUC this also returns the pchunk id, please correct me if I'm wrong).

Copy link
Copy Markdown
Collaborator

@JacksonYao287 JacksonYao287 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that a new rreq is created after clear_chunk_req and before vchunk->reset(), and still hit the same issue?

// NOTE: This introduces potential double-free risk with gc_repl_reqs() background thread.
// See https://github.com/eBay/HomeObject/issues/401.
auto hs_pg = m_hs_home_object->get_hs_pg(pg_id);
hs_pg->repl_dev_->clear_chunk_req(move_to_chunk);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls move this change into purge_reserved_chunk and before vchunk->reset()

@Besroy
Copy link
Copy Markdown
Contributor Author

Besroy commented Mar 24, 2026

is it possible that a new rreq is created after clear_chunk_req and before vchunk->reset(), and still hit the same issue?

Good question. Since the chunk is already a move_to_chunk, meaning it is in the m_reserved_chunk_queue, will there still be new rreq on this chunk? If I understand correctly, there are two types of write IO on the chunk: put and delete. For a put, since there is no open shard on the chunk, it will first create one — the create operation will be blocked at select_specific_chunk due to chunk_state=GC. For delete, it does not occupy space, so there will be no blk distribution. If that's the case, currently there is no risk.

However, it would be better if you can also review the blob path logic while sorting out the shard process to see if there are other potential issues with GC concurrency.

@Besroy Besroy force-pushed the fix_crash branch 3 times, most recently from 277fbe8 to e7df539 Compare March 24, 2026 03:23
When GC resets move_to_chunk via purge_reserved_chunk(), stale repl_reqs may still exist and be cleaned up by background gc_repl_reqs().
This causes two race conditions:
 1. Stale rreq frees blk on NEW allocator after reset (wrong allocator)
 2. Stale rreq frees blk on OLD allocator during reset, accessing destroyed superblock and causing crash
@Besroy Besroy merged commit 5cb5144 into eBay:main Mar 24, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants